Measuring sustainable tourism with online platform data

Sustainability in tourism is a topic of global relevance, finding multiple mentions in the United Nations Sustainable Development Goals. The complex task of balancing tourism’s economic, environmental, and social effects requires detailed and up-to-date data. This paper investigates whether online platform data can be employed as an alternative data source in sustainable tourism statistics. Using a web-scraped dataset from a large online tourism platform, a sustainability label for accommodations can be predicted reasonably well with machine learning techniques. The algorithmic prediction of accommodations’ sustainability using online data can provide a cost-effective and accurate measure that allows to track developments of tourism sustainability across the globe with high spatial and temporal granularity. Supplementary Information The online version contains supplementary material available at 10.1140/epjds/s13688-022-00354-6.


Supplementary Information
characteristics. It is used to create three variables for the final dataset, namely the total number of words in the section and two binary variables indicating the occurrence of the words "renewable" and "sustainable" in the text. The TripAdvisor website uses customer ratings to create local rankings, which are displayed on each accommodation's page in the format "#X out of Y hotels in area". The raw variables Ranking in area and Total in area are used to create a score between 0 (included) and 1 (excluded) indicating the relative rating of a listing compared to other listings in the area. This relative score is included in the final dataset, the initial variables are discarded. In the above section, listings are further classified as Hotel, B&B/Inn or Specialty Lodging. This categorical variable is encoded into two binary variables in the final dataset. The variables Specialty Lodging and B&B/Inn are assigned the value 1 if they apply to a listing and 0 if not. This leaves the accommodation type Hotel as the base category, which is not one-hot encoded to avoid perfect multicollinearity.
In a listing's review section, TripAdvisor publishes summary data about the distribution of its ratings and lets users select comments in their languages of interest. The language subsection further shows up to four common review languages for the listing and the number of reviews written in each language. From this, nine raw variables can be retrieved: The total number of comments, the most prevalent review languages and the associated number of reviews in each language. The data is used to create two variables for the final dataset, the proportion of local language reviews and the proportion of English language reviews. For countries with multiple official languages, the sum of all reviews in the official languages is divided by the total number of reviews to calculate the proportion of local language reviews.
For countries with a single official language, the number of reviews in the local language is divided by the total number of reviews. The proportion of English language reviews follows the same logic.
An additional way for TripAdvisor page owners to inform potential customers about their accommodation is through the selection of available property amenities and room features. To the user, only facilities and features available at the accommodation are visible in the listing. Page owners can select from over 340 amenities and over 300 room features. From a small mixed sample of accommodations with and without the GreenLeader award, 34 amenities and 20 room features that differed most strongly in their prevalence between the two groups were selected. During the scraping of each listing, the two abovementioned sections are scanned for the selected features and amenities. From this, 54 one-hot encoded binary variables are created, with value 1 if the feature or amenity is available at the location and 0 if not. The same principle is used to control for the availability of seven different room types. The total number of available property amenities and room features is further included in the final dataset through variables recording the length of each section, irrespective of the exact facilities available.  Amenities "Free Parking" 46 Amenities "Fitness" 47 Amenities "Bar" 48 Amenities "Children Entertainment" 49 Amenities "Pets" 50 Amenities "Sauna" 51 Amenities "Hot Tub" 52 Amenities "Pool" 53 Amenities "Coffee Shop" 54 Amenities "Special Diet" 55 Amenities "Breakfast" 56 Amenities "Breakfast Room" 57 Amenities "Taxi service" 58 Amenities "Airport shuttle" 59 Amenities "Business center" 60 14 Most ordinal variables following a roughly normal distribution are ratings. These include the externally determined hotel class ranging from one to five stars as well as usergenerated ratings for location, cleanliness, service, value and overall quality in the same range. The last category of distributions with similar characteristics entails proportions and capped variables. For the proportion of reviews in the local language(s) and proportion of English-speaking reviews variables, two distributional centers are found. The proportion of local reviews peaks for values close to zero and values close to one. Similarly, reviews in the English language make up a small portion of total reviews for many listings but close to 100% of reviews for another large group. The number of room tips follows a similar distribution because of the way TripAdvisor displays the variable. For values up to 100, the true number of tips is displayed. For listings with more tips, the displayed number is capped and set equal to 100 on the listing's page. As a result, the variable distribution has a large peak for very low numbers and a secondary peak at the maximum possible value. In the same way as the abovementioned variables, any transformation of proportion and capped variables will be evaluated on the improvement of model prediction capabilities.

Correlation between variables
GreenLeader Badge and numeric variables: Principal Component Analysis (PCA) is a dimensionality-reduction technique that can be used both in the exploratory stage of data-based research and as a preparatory step in the modelling process. The underlying idea is that a small number of well-chosen dimensions may be able to carry a large portion of the available information while simultaneously avoiding issues caused by high dimensionality of the original data. PCA aims to do this by finding the directions along which observations show the highest variance. An alternative interpretation is that the first principal component (PC) is the line which fits as close as possible to the available data.
When principal component analysis is applied as a preparatory step for the creation of a model, a small number of dimensions is not only assumed to explain a large proportion of the variation in the data but to also have a strong relationship with the response variable. Instead of using the original input variables as predictors, the principal components take their place. If the abovementioned assumptions hold, this will lead to better generalization results. This is possible because most of the information will be contained while the reduced number of estimated coefficients can aid in the mitigation of overfitting. The transformation is most effective for increasing a classifier's prediction capabilities when the first few principal components capture most of the variation in the data (James et al., 2013). When PCA is used as an exploratory tool, insights can be gained from the analysis of correlations. The correlation between a principal component and a given variable can be interpreted as shared information.
By design, the sum of squared correlation coefficients between a given variable and all PCs is equal to one. Accordingly, the correlations can give an overview of variables relative influence on each dimension and inform which variables affect the most important dimensions (Abdi & Williams, 2010).

Sampling Methods
In order to prevent classifiers from learning that minority class observations are scarce and not important, the analysis of an imbalanced dataset requires extra steps. A range of methods is available both for the transformation of the dataset and for the modification of the classifiers.
Sampling methods are part of the first group. They focus on changing the composition of the training data, aiming for a balance of class frequencies. The name sampling stems from the resampling of an already existing dataset. Sampling methods ignore possible causes of imbalance in the domain specific to the problem. Instead, the issue of relative class imbalance is solved in a widely applicable manner. This can be of great benefit to the later analysis as it allows for the application of standard, unmodified machine learning algorithms to imbalanced domains (He & Ma, 2013). It is important to note that sampling methods are only applied to the training data. In order to get a fair understanding of classifier performance, class proportions in the holdout data remain representative of the full sample. In the following, the random under-and oversampling and synthetic minority oversampling techniques will be introduced.
Random data sampling or naïve sampling techniques offer the easiest implementation and execution of the discussed methods. No assumptions about the data are made and the fast execution is a desirable feat for large datasets. The methods work through the deletion or duplication of existing observations. Random undersampling balances class frequencies through the deletion of majority class observations. As no additional examples are added, the method is more suitable for relative class imbalance with a sufficient absolute number of minority class observations. Since observations are selected randomly for deletion, no mechanism to preserve examples especially rich in information is in place. This can lead to the deletion of large portions of useful data. The Random oversampling technique on the other hand selects and duplicates examples from the minority class and adds the resulting observations to the training set. The selection is performed randomly and with replacement so that original observations may be duplicated multiple times. Accordingly, random oversampling is effective when multiple realizations with the same properties can affect the classifier. This is the case for models using iterative optimization and for classifiers using data splits, such as decision trees (Brownlee, 2020). With strong imbalances, random oversampling can cause a model to overfit minority class observations. This risk should be kept in mind to avoid the loss of generalization ability.
While random oversampling manages to balance class frequencies without loss of data, it provides no additional information. The Synthetic Minority Oversampling Technique (SMOTE) improves upon the simple duplication of observations by creating new, unique observations on the basis of existing data. This is done by first randomly selecting an example from the minority class and determining its k nearest neighbours, where k is a hyperparameter and typically small. In the next step, a single observation is randomly selected from the set of neighbours and a line between the original observation and the selected neighbour is drawn in feature space. In the final step, a new, synthetic observation is sampled at a point along the line (Brownlee, 2020). While advantageous in many situations, adding a synthetic observation to the dataset also brings about risks. Since new examples are created without consideration of the majority class, SMOTE may create ambiguous observations in areas where classes have a strong overlap. Finally, it is worth considering the application of multiple sampling techniques to the same set of data. Employing both an under -and an oversampling method has been shown to achieve better results than focusing on one technique (Chawla et al., 2002).

Transformations
The difficulty of a modelling problem may be increased by differences in scale and distribution of the included variables. A large proportion of popular algorithms performs better when all numerical inputs are scaled to the same range. Algorithms affected by this include those using a weighted sum of the input to predict outputs, such as linear regression, logistic regression, and artificial neural networks. The second group of algorithms requiring scaled data for good performance works on the basis of distance measures. This group includes the k-nearest neighbors algorithm and support vector machines. However, not all approaches are affected.
Tree-based methods such as decision trees and random forests do not require scaled variables (Brownlee, 2016). In the following, data transformations through normalization, standardization and robust standardization will be introduced. The section then moves to generalized power transformations, which aim to make variable distributions more Gaussian. . Standardization assumes that variable distributions can be fit to a Gaussian distribution. This includes having a well-behaved mean and standard deviation. It is possible to standardize variables not fulfilling these assumptions, but the achieved results may not be reliable. Outliers also pose a risk to standardization as they tend to skew either mean, standard deviation or both. A solution to this is robust scaling, which relies on median and interquartile range (IQR) instead of mean and standard deviation. Formally, ! = ($ − &67'*()/9:;, where the median is the middle value of the sample distribution. The interquartile range describes the difference between the 75 th and the 25 th percentile of the distribution. Similar to standardization, robust scaling will output a distribution with zero mean and median. Unlike standardization however, outliers will neither skew the output nor lose their relative relationship to other datapoints (Brownlee, 2016). Which scaler performs best in terms of classifier performance will not be clear from the beginning. Brownlee (2016)  The abovementioned techniques aim at scaling data so that variable distributions fall into the same range and fulfil simple model assumptions. Some statistical learning techniques, such as logistic regression and the Naïve Bayes classifier, extend assumptions to include a Gaussian distribution of input variables. A second group of models does not explicitly require normally distributed data but tends to perform better in its presence (Brownlee, 2016). It is therefore worthwhile to review power transformations, which employ logarithmic and exponential transformations to remove variable skew and stabilize variance. Such transformations can be achieved via the direct application of the logarithm, square root or others to a given variable. While quickly realized, this may not always lead to the use of the optimal power transform for each variable. Instead, generalized approaches, which find the optimal transformation through parameter tuning, have been developed. This paper employs two approaches for the use of parameters in making variable distributions more Gaussian, the
Depending on the value used, a reciprocal transform, log transform, square root transform, a range of combinations or no transformation at all can be employed. Optimal values found in the training data are saved and later applied to new data. To guarantee strictly positive input data, a preprocessing step scaling all variables to the range between one and two will be used beforehand. The Yeo-Johnson transformation follows a similar principle but does not require pre-processing as it also allows for zero and negative input values.
It is formally given by: where the optimal parameter @ is once again estimated using the training data (Yeo & Johnson, 2000). As the improvement of prediction quality is the core goal of this paper, all data transformation and their combinations will be evaluated by their ability to improve final model output.

Supplementary Information VII: Description of classifiers used in grid-search
Logistic Regression The first considered statistical modelling technique is Logistic Regression. It is a supervised learning technique, meaning that group membership is known in the training data. Logistic regression has two complimentary objectives: Classification and explanation. The final evaluation and comparison to other techniques will be performed on the basis of classification performance.
A binary classification could be approached through a linear regression model if the response variable is coded to values 0 and 1 using the dummy variable approach. In the case of this paper, the categorical outcome GreenLeader would be coded to the numerical value 1 and not GreenLeader would be coded to 0.
Modelling the binary response with least squares regression like this is not without merit. It provides a simple solution to the problem and VW X is indeed an estimate of the probability that the regressand takes value 1 (James et al., 2013). However, this approach violates two assumptions of linear regression that cannot be remedied through transformation of the variables. Firstly, the error term of the discrete variable does not follow the normal distribution, rendering statistical testing based on the assumption of normality invalid. Secondly, the variance of a Bernoulli random variable is a function of of V / , . . . , V Z , and therefore heteroscedastic by definition (Hair et al., 2019). To avoid these shortcomings, an alternate model needs to be able to accommodate the binomial distribution and heteroscedasticity. In addition, the function used in modelling the probability should return outputs between 0 and 1 for all input values. Logistic regression fulfils these requirements.
In its standard form, the underlying logistic function is expressed as: [($) = 1 1 + 6 R\ = 6 \ 1 + 6 \ Filling in the linear regression model with an example model using a single regressor, the full formula is given by: The calculation of the probability with a range of input values makes clear that it will be between 0 and 1 for every input level. The function will return an S-shaped curve, which gets close to but never becomes 0. The output value will approach 1 with increased input values but never fully reach it (James et al., 2013).

Linear Discriminant Analysis
The second modelling technique considered for the classification of observations is linear discriminant analysis (LDA). Like logistic regression, LDA is both a supervised learning technique and a classification method. Accordingly, a model is trained with labelled datapoints in order to later correctly assign unlabeled observations to the available classes.
While logistic regression directly models the conditional distribution of the response class given the explanatory variables, LDA follows a less direct approach. The technique makes use of differences in the distribution of the predictor variables between the response groups. This follows the intuition that the distribution of V given c can offer information about the distribution of c given V (James et al., 2013). The relevant theorem by Bayes states: LDA assumes that V = jV / , V 5 , … , V l m is drawn from a multivariate normal or Gaussian distribution. In addition, LDA assumes a common covariance matrix n o = n ∀ e and a classspecific mean vectors. Formally, the multivariate Gaussian density is denoted as: After plugging all available information into Bayes' Theorem and rearranging, it can be shown that the classifier assigns the observation $ to the class for which w Z ($) = $ u n R/ -Z − 1 2 -Z u n R/ -Z + xyzg Z returns the largest value. As suggested by the term linear discriminant analysis, the classification rule depends on $ through the linear combination of the observation's elements.

Quadratic Discriminant Analysis
The third classifier considered is quadratic discriminant analysis (QDA). Like LDA, QDA is based on the assumption that the observations in a given class are drawn from the Gaussian distribution. QDA also makes use of Bayes' Theorem, again resembling the LDA procedure.
Unlike linear discriminant analysis however, quadratic discriminant analysis does not assume a common covariance matrix for observations of all classes and instead allows for different covariance matrices for each class. Formally, a given observation will be of the form V ~ |(-Z , Σ Z ), where both the mean and the covariance matrix are class-dependent (James et al., 2013). (James et al., 2013). This change of assumptions has a significant effect on the complexity of the classifier's decision rule. assuming class-dependent covariance matrices, an observation is assigned to the class for which the following is largest (Friedman et al., 2001):

Random Forest Classifier
The final method included in the analysis is the random forest classifier. It is an ensemble method and an algorithmic modelling approach. As such, the method fundamentally differs from the methods introduced earlier. Tree-based methods work through the segmentation of the predictor space. This segmentation is achieved through a stepwise process based on previously determined criteria. In the case of classification trees, the optimal next cutting point is determined based on measures of impurity.
Low variance is a desirable attribute for an estimation procedure because similar results can be achieved even when using distinct data sets. Decision trees alone are not able to perform well in this regard but can be improved upon by bootstrap aggregation or bagging. To understand the advantages of the additional effort, it is useful to recall that, given ( independent observations with variance . 5 , the mean of these observations has variance . 5 ( ⁄ . Accordingly, a reduction in variance can be achieved through averaging over observations. In the regression context, this translates to using multiple training sets to build separate models and ultimately averaging over them. In classification, the same effect can be achieved by letting multiple models participate in a majority vote to determine the final class estimate. The advantages of this process can be likened to the wisdom of crowds-concept, supposing that collective knowledge exceeds individual knowledge given sufficient diversity of the crowd (James et al., 2013). In order to perform this method using a single training set, repeated subsamples comprised of different combinations of the original observations are taken. This process is called bootstrapping. Finally, the combination of bootstrapped sampling and moving ahead with the average or majority vote prediction is known as bootstrap aggregation or bagging.
Any correlation between decision trees limits the effect of bootstrap aggregation. The principle of the random forest method is to improve the reduction of variance from bagging by decorrelating trees (Hastie et al., 2009 Accordingly, the number of dimensions to be used is decided using guidelines. The following figure visualizes three guidelines: The scree plot on the left visualizes the proportion of total variance the first 20 components capture. While the first PC captures more than 25% of variation and the second PC still captures close to 15%, numbers quickly drop and then fall in a slow but consistent manner. This creates the possibility to determine the optimal number of components via the elbow rule. According to this guideline, the optimal number of components for the analysis is four to five. A second guideline, the so-called Kaiser rule, advises to include those components with eigenvalues above one. This is visualized via the horizontal line in the second graph. Following the guideline would thus lead to the use of eight components in the further analysis. The final guideline considered is to add components to the analysis until 90% of variation is explained.
In the third plot, this threshold is again visualized by a horizontal line. To capture 90% of the variation in the dataset, 18 PCs should be included. Moving forward, the elbow rule guideline will be applied during the exploratory stage, where the focus is on getting a first intuition of patterns in the data rather than dimensionality reduction for further use of the data. Other guidelines will be applied during later stages of the analysis. The largest four principal components are able to capture roughly half of the variation present in the data.

Supplementary Information X: Clustering Analysis
Like PCA, clustering can be used to develop an intuition of patterns in the data. Unlike the above analysis however, the focus of clustering lies in finding groups of similar observations rather than groups of variables. These groups of similar observations, or clusters, can potentially serve to find typical observations or prototypes. For the goal at hand, this method of clustering and finding prototypes would ideally inform about differences between groups of sustainable and non-sustainable listings. In the following, the use of PCA as a preprocessing step for the clustering analysis is discussed and applied. Options for the number of clusters are evaluated based on the respective sums of distances and silhouette analysis. Lastly, the clusters determined by the most successful method are reviewed.
To avoid the pitfalls of applying geometric distance measures to high-dimensional data, PCA is applied as a preparatory step for K-means clustering. Due to the dimensionality reduction technique's assumptions, this method is again exclusively applied to continuous variables. Since the unsupervised learning task by definition does not have training labels, the optimal number of clusters is determined with the help of two guidelines. Firstly, a plot showing the sum of distances from all observations to the closest cluster center for K = {1,2,…,10} is analyzed. In a similar fashion to principal component analysis, an appropriate number of clusters can be found via the elbow-method. However, the data is not easily segmented. Both the analysis using the first four and the analysis using the first eight principal components shows no obvious elbow point.
Scree plots for K-means clustering using four and eight PCs As a second guideline, Silhouette Analysis is applied. Like the abovementioned elbow method, this technique employs a distance-based measure to create a numerical basis of comparison for the number of clusters used. In addition to this averaged Silhouette Score, a visualization of each single observation's score can aid to better understand groups of observations. The silhouette score is defined as where a is the mean distance between an observation and every other observation in the cluster it is assigned to, referred to as the mean intra-cluster distance. b on the other hand measures the mean nearest-cluster distance, which averages the distances between an observation and all observations of the closest cluster. The silhouette score returns values between negative one at worst and positive one at best. Intuitively, the cluster assignment is poor when an observation is closer to observations from another cluster than to its peers in the same cluster. Conversely, the more dense and well-separated clusters are, the larger the total average score will be (Kaufman & Rousseeuw, 2009). For K-means clustering of the data's first four principal components, the silhouette plots for K = {2,3,4,5} are pictured below. For each K, cluster assignments were calculated with a limit of 100 iterations of the algorithm and selecting the most successful out of ten runs. The diagram shows that the segmentation into two clusters is the only setting in which the average silhouette score, pictured by the dashed red line, is above 0.5. On the other hand, the choice of only two clusters leads to extreme differences in cluster size with a single cluster containing close to 90% of observations. In addition, all observations assigned to the smaller cluster have silhouette scores below the total average. For segmentations into three, four and five clusters, silhouette scores between 0.265 and 0.283 are found. From these very similar and low scores, it can be inferred that clustering of the observations is not an easy task here. The same silhouette analysis was performed for the clustering of observations using the first eight principal components. However, no improvement to the quality of segmentation could be found. Nonetheless, applying clustering to the data can help showcasing structure in the data and identifying prototype accommodations.